Introduction

The DDD-GUI is used to perform differential expression (DE) gene and transcript, differental alternaive spliced (DAS) gene and differential transcript usage (DTU) transcript analysis. To use our pipeline in your work, please cite:

Calixto,C.P.G., Guo,W., James,A.B., Tzioutziou,N.A., Entizne,J.C., Panter,P.E., Knight,H., Nimmo,H., Zhang,R., and Brown,J.W.S. (2018) Rapid and dynamic alternative splicing impacts the Arabidopsis cold response transcriptome. Plant Cell.

Installation

Required R packages: shiny, shinydashboard, shinyFiles, tximport, edgeR, plotly, RUVSeq, eulerr, gridExtra, grid, ComplexHeatmap, Gmisc, limma

Prepare input data

The following datasets are required in the pipeline (Figure 1):

Note: the analysis will read information according to the column names in the spreadsheets. Please use the same column names as in examples.

Figure 1: User provided datasets.

Output data

Four folders will be created in the working directory to save the analysis results:

DDD-GUI

The GUI includes 7 pages in the sidebar menu: Introduction, Data generation, Data pre-processing, DE DAS and DTU analysis, Result summary, Advanced plot and Generate report.

Data generation

  1. To begin with, users can choose a working directory to save the output data, results and figures.
Figure 2: Choose working directory
  1. The whole pipeline has a number of steps and each step is related to several datasets. This table is the summary of datasets required for the analysis. To generate/upload the datasets, the following options are provided:

    • If the datasets (in .RData format) do not exist in the “data” folder of working directory, users can generate these datasets step-by-step.
    • If all the required datasets are generated and saved in the “data” folder, users can upload by clicking “Load all” button.
    • To perform analysis or visualise results in a specific step, users also can unload the required datasets for individual steps.

Figure 3: Data information for the analysis. The columns:

  • Data – the .RData file names in the “data” folder.
  • In data folder – logical, whether the required datasets exist in the “data” folder.
  • In workspace – logical, whether the required datasets have been uploaded for analysis.
  • Description – description of the datasets.
  • Generate from – the steps to generate the required datasets.
  • Use in – the steps to use the datasets.

  1. Generate gene and transcript expression. The csv files “mapping.csv” and “samples.csv” (see Figure 1A and B) are uploaded (Figure 4A) and visualised (Figure 4C) in this interface. The tximport R package (Soneson, Love, and Robinson 2016) is used to generate gene and transcript expression of read counts and TPMs (transcript per million reads) based on three options: “no” (default), “scaledTPM” and “lengthScaledTPM”. The “lengthScaledTPM” is recommended since it scales the expression using the transcript length and then the library size over samples (see the tximport documentation for details). The generated gene and transcript expression will be save as R data object txi_genes.RData and txi_trans.RData in the “data” folder, respectively. Spreadsheets of read counts and TPMs will be saved as csv files in the “result” folder.

Figure 4: Generate gene and transcript expression.

Data pre-processing

Once the read counts and TPMs are generated, the data is pre-processed with steps: Merge sequencing replicates (if exist), Remove low expressed transcripts/genes, Estimate batch effects (if exist) and Data normalisation. In each step, quality control plots are generated to optimise the paramters for pre-processing (Figure 5).

Figure 5: Data pre-processing. The figures were generated based on the RNA-seq data from study of Arabidopsis in response to cold (Calixto et al. 2018)

Merge sequencing replicates

The sequencing replicates are generated by sequencing the same tissue several times, which do not add too much information to the technical variation. Sequencing replicates from the same biological replicate are often added to increase the depth. In the panel of Figure 6A, the sequencing replicate information will be extracted from the “srep” column of the “samples.csv” file ( Figure 1B) to indicate whether to merge the sequencing replicates.

Filter low expressed genes/transcripts

The low expressed transcripts are filtered based on count per million (CPM) reads: \[ CPM=\frac{X_i}{N}\times 10^6 \] where \(X_i\) is read count of transcript \(i\) and \(N\) is the library size of the sample.

  • An expressed transcript must have \(\geq n\) samples with expression \(\geq m\) CPM.
  • An expressed gene must have at least one expressed transcript.

Note: The sample number cut-off \(n\) and CPM cut-off \(m\) are determined by the mean-variance trend plot (Figure 6B and Figure 6C). Read counts are assumed to be negative binomial distributed of which the mean and variance has a relation: \[ Var(\log_2X)=\frac{1}{\mu}+\phi \] where \(X\) is read count, \(\mu\) is the mean and \(\phi\) is the overdispersion. The expression variance decreasing monotonically with the increasing of the mean. In real RNA-seq data, the low expressed transcripts confound with noise and the expression distribution is different to the expressed transcripts. In the mean-variance trend plot, the low expressed transcripts cause a drop trend in low expression region (Figure 5 and Figure 6C). Therefore, the cut-offs \(n\) and \(m\) to filter the low expression can be optimised until the drop trend disappeared in the meam-variance trend plot.

Figure 6: Merge sequencing replicates and filter low expressed genes/transcripts.

Principal component analysis (PCA) and batch effect estimation

PCA is a mathematical method to reduce expression data dimensionality while remaining most of the data variance. The reduction is performed by projecting the data to directions or principal components(PCs) of highest data varablity. For example. Therefore, the data main variance is accessible by investigate a few number of PCs rather than thousands of variables. In omic data analysis, the first two to four principal componets are often used to visualise the similarities and differences of samples, thereby determing the groups of samples. In the PCA scatter plot, the samples from the same condition often stay close, but far from the samples of other conditions. It can be used as evidence for data quality checking.

PCA plot also can be used to identify whether the RNA-seq data contain batch effects, which are caused by biological replications in different laboratory conditions. Compared with random noise, batch effects can be distinguished owning to the systematic biases of signals across replicates. For example, Figure 7A shows the PCA plot (PC1 vs PC2) of a RNA-seq data with 3 time-points T2, T10 and T19. Each time-point has 3 biological replicates. It can be seen that the first bioloigcal replicate (brep1) stay in a seperate cluster to other two replicates, which indicates a clear batch effect of the data. In this pipeline, the RUVSeq R package (Risso et al. 2014) is used to estimate the batch effects. The RUVSeq algorithm generates batch effect terms which can be incorporated in the design matrix of linear regression model for DE and DAS analysis, i.e.

\[ \text{observed expression = baseline effects + batch effects + noise} \]

It also generates a modified expression matrix in which the batch effects has been removed from the data. To avoid data over-modification, this dataset is only used to make PCA plot, but not used for downstream DE and DAS analysis (Figure 7B).

In this panel (Figure 7), users can select and visualise different PCs based on transcript level or gene level exprssion, or the average expression of biological replicates. The scatter points can be grouped and coloured according to biological replicates or conditions. Ellipses or polygons can be added to the plots to highlight the grouped clusters. Three options, RUVr, RUVs and RUVg, in the RUVSeq R package are provided to estimate the batch effects (see the package documentation for details). Users can “Update PCA plot” to visualise whether the batch effects are removed properly. If no clear batch effects in the PCA plot, please skip this step.

Figure 7: PCA plot before (A) and after (B) remove batch effects.

Data normalisation

To enable unbiased comparisons, read counts must be normalised on the basis of sequencing depth across samples. Normalisation methods Trimmed Mean of M-values (TMM), Relative Log Expression (RLE) and upper-quartile are provided to reduce the effects from the systematic technical biases across samples (M. D. Robinson and Oshlack 2010; Bullard et al. 2010). Boxplots are used to visualise the expression distribution across samples before and after normalisation.

Figure 8: Data normalisation.

DE, DAS and DTU analysis

Definition of DE, DAS and DTU in this pipeline

Statistics:

  • Adjusted p-value: in the expression comparitive analysis, each gene/transcript is fitted with a statistical model, reporting a p-value. However the testing is performed over a substantial population of genes/transcripts and each testing has a probability of making errors. Hence, all p-values in the analsysis are adjusted by controlling the false discovery rate (FDR). Please see the R function p.adjust for details.
  • \(L_2FC\): \(\log_2\) fold change of expression based on contrast groups.
  • \(\Delta PS\): Transcript level PS (percent of splice) is defined as the ratio of transcript average abundance of conditions divided by the average gene abundance. \(\Delta PS\) is the PS differences of conditions based on the contrast groups. The abundance to calculate PS values can be TPMs or read counts.

DE gene/transcript analysis: A gene/transcript is identified as DE in a contrast group if \(L_2FC\) of expression \(\geq\) cut-off and with adjusted p-value < cut-off.

Two pipelines are provided for DE analysis: “limma-voom” and “edgeR”. The “edgeR” pipeline includes two methods: “glmQL” (Genewise Negative Binomial Generalized Linear Models with Quasi-likelihood Tests) and “glm” (Genewise Negative Binomial Generalized Linear Models). In limma pipeline, \(L_2FCs\) are calculated based count per million reads (CPMs) while in edgeR, \(L_2FCs\) are based on read counts. From several real RNA-seq data analyses, high consensus is achieved between “limma-voom” and “glmQL” (>90%) and they have more stringent controls of false discovery rate than the “glm” method.

DAS gene and DTU transcript analysis: To test differential expression, the gene level expression is compared to transcript level expression in the contrast groups. A gene is DAS in a contrast group if adjusted p-value < cut-off and at least one transcript of the gene with \(\Delta PS \geq\) cut-off. A transcript is DTU if adjusted p-value < cut-off and \(\Delta PS \geq\) cut-off.

Two pipelines for expression comparative analysis:

  • limma pipeline: To identify DAS genes, the expression of each transcript is compared to the weighted average expression of all the transcripts for the same gene (weight on transcript expression variance; the average expression can be treated as gene level expression). P-value of each test is converted to genewise p-value by using F-test or Simes method. To identify DTU transcript, each transcript is compared to the weighted average of all the other transcripts for the same gene. See the diffSplice function in limma package for details.
  • edgeR pipeline: To identify DTU transcripts, log-fold-change of each transcript is compared to log-fold-change of entire gene (sum of all transcripts). The DTU transcript test p-values are summarised to genewise p-value by using F-test or Simes method to report DAS genes. See diffSpliceDGE function in “edgeR” R package for details.

In the GUI, contrast groups in csv file (Figure 1D) can be uploaded in panel Figure 9A. DE, DAS and DTU analysis pipeline, \(\Delta PS\) and cut-offs of test statistics can be selected/generated in panels Figure 9B and C. Subset of \(\Delta PS\) can be visualised in panel Figure 9D.

Figure 9: DE and DAS analysis.

Result summary

This page includes panels to show the DE, DAS and DTU analysis results (Figure 10-12):

Figure 10: Number of DE genes, DAS genes, DE transcripts and DTU transcripts, and bar plots of up-down regulation numbers.

Figure 11: Euler diagram of numbers between contrast groups (A) and numbers to compare DE vs DAS genes and DT vs DTU transcripts (B).

Figure 12: Flowchart to show numbers of DE and/or DAS genes and DE and/or DTU transcripts.

Advanced plot

Heatmap

Users can make heatmap for DE genes, DAS genes, DE transcripts and DTU transcripts identified in the analysis (Figure 13A). Users can also upload a gene or transcript list in csv file to make heatmap for specific list (Figure 13B). To make the heatmap, the average TPMs of conditions for the targets are standardized into z-scores. The dist and hclust R functions are used to perform the hierarchical clustering analysis. The heamaps and the target list in each cluster of the heatmaps can be saved to local folder.

Figure 13: Heatmap panel (A) and example user provided gene list in csv file for heatmap (B).

Profile plot

Gene and transcript expression profiles (TPMs or read counts) and PS (percent spliced) can be visualised by typing a gene name (Figure 14A). Users can also generate a number of plots by provides a gene list (Figure 13B) and all the plots will be saved to “figure” folder in the working directory (Figure 14B).

Figure 14: Plot of expression profiles and PS across conditions

GO annotation plot

Users can generate DE and DAS gene list by click “Generate” button (Figure 15B). These gene lists can be uploaded to Gene Ontology (GO) analysis tools (e.g. DAVID and agriGO) to generate GO annotation. To visualise the annotation plot, a csv file need to be uploaded to the GUI (Figure 15B). The file includes a “Category” column of CC (cellular component), BP (biological process) and MF (molecular function), a column of “Term” of GO annotation and columns of statistics to report the annotation enrichment, e.g. count, FDR, -log10(FDR), etc. (Figure 15A). In the panel, users can select different statistics to visulise. The selected gene list type will present in the plot title to distinguish whether the provided gene list is DE or DAS genes.

Figure 15: GO annotation plot

Generate report

The table in this page lists the location of save files and the parameters for the analysis. If the information is wroing, users can double click the cells in the table to correct to values. Based on this information, report in three format, word, pdf and html, will be generated and saved in the “report” folder of working directory.

Figure 16: Generate report

Session information

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Gmisc_1.6.4                 htmlTable_1.12             
##  [3] Rcpp_0.12.18                ComplexHeatmap_1.18.1      
##  [5] gridExtra_2.3               eulerr_4.1.0               
##  [7] RUVSeq_1.14.0               EDASeq_2.14.1              
##  [9] ShortRead_1.38.0            GenomicAlignments_1.16.0   
## [11] SummarizedExperiment_1.10.1 DelayedArray_0.6.6         
## [13] matrixStats_0.54.0          Rsamtools_1.32.3           
## [15] GenomicRanges_1.32.6        GenomeInfoDb_1.16.0        
## [17] Biostrings_2.48.0           XVector_0.20.0             
## [19] IRanges_2.14.11             S4Vectors_0.18.3           
## [21] BiocParallel_1.14.2         Biobase_2.40.0             
## [23] BiocGenerics_0.26.0         plotly_4.8.0.9000          
## [25] ggplot2_3.0.0               edgeR_3.22.3               
## [27] limma_3.36.3                tximport_1.9.12            
## [29] shinyFiles_0.7.1            shinydashboard_0.7.0       
## [31] shiny_1.1.0                
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.3-2       rjson_0.2.20           hwriter_1.3.2         
##  [4] rprojroot_1.3-2        circlize_0.4.4         base64enc_0.1-3       
##  [7] GlobalOptions_0.1.0    fs_1.2.6               rstudioapi_0.7        
## [10] bit64_0.9-7            AnnotationDbi_1.42.1   splines_3.5.1         
## [13] R.methodsS3_1.7.1      DESeq_1.32.0           geneplotter_1.58.0    
## [16] knitr_1.20             Formula_1.2-3          jsonlite_1.5          
## [19] annotate_1.58.0        cluster_2.0.7-1        R.oo_1.22.0           
## [22] compiler_3.5.1         httr_1.3.1             backports_1.1.2       
## [25] assertthat_0.2.0       Matrix_1.2-14          lazyeval_0.2.1        
## [28] later_0.7.4            acepack_1.4.1          htmltools_0.3.6       
## [31] prettyunits_1.0.2      tools_3.5.1            bindrcpp_0.2.2        
## [34] gtable_0.2.0           glue_1.3.0             GenomeInfoDbData_1.1.0
## [37] dplyr_0.7.6            rtracklayer_1.40.6     stringr_1.3.1         
## [40] mime_0.5               XML_3.98-1.16          zlibbioc_1.26.0       
## [43] MASS_7.3-50            scales_1.0.0           aroma.light_3.10.0    
## [46] hms_0.4.2              promises_1.0.1         RColorBrewer_1.1-2    
## [49] yaml_2.2.0             memoise_1.1.0          rpart_4.1-13          
## [52] biomaRt_2.36.1         latticeExtra_0.6-28    stringi_1.1.7         
## [55] RSQLite_2.1.1          genefilter_1.62.0      checkmate_1.8.5       
## [58] GenomicFeatures_1.32.2 shape_1.4.4            rlang_0.2.2           
## [61] pkgconfig_2.0.2        bitops_1.0-6           evaluate_0.11         
## [64] lattice_0.20-35        purrr_0.2.5            bindr_0.1.1           
## [67] htmlwidgets_1.2        bit_1.1-14             tidyselect_0.2.4      
## [70] plyr_1.8.4             magrittr_1.5           R6_2.2.2              
## [73] Hmisc_4.1-1            DBI_1.0.0              foreign_0.8-71        
## [76] pillar_1.3.0           withr_2.1.2            nnet_7.3-12           
## [79] forestplot_1.7.2       abind_1.4-5            survival_2.42-6       
## [82] RCurl_1.95-4.11        tibble_1.4.2           crayon_1.3.4          
## [85] rmarkdown_1.10         GetoptLong_0.1.7       progress_1.2.0        
## [88] locfit_1.5-9.1         data.table_1.11.6      blob_1.1.1            
## [91] digest_0.6.17          xtable_1.8-3           tidyr_0.8.1           
## [94] httpuv_1.4.5           R.utils_2.7.0          munsell_0.5.0         
## [97] viridisLite_0.3.0

References

Bullard, James H, Elizabeth Purdom, Kasper D Hansen, and Sandrine Dudoit. 2010. “Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments.” BMC Bioinformatics 11 (1): 94. doi:10.1186/1471-2105-11-94.

Calixto, Cristiane P G, Wenbin Guo, Allan B James, Nikoleta A Tzioutziou, Juan C Entizne, Paige E Panter, Heather Knight, Hugh Nimmo, Runxuan Zhang, and John W.S. Brown. 2018. “Rapid and Dynamic Alternative Splicing Impacts the Arabidopsis Cold Response Transcriptome.” The Plant Cell. American Society of Plant Biologists. doi:10.1105/tpc.18.00177.

Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon provides fast and bias-aware quantification of transcript expression.” Nature Methods 14 (4): 417–19. doi:10.1038/nmeth.4197.

Risso, Davide, John Ngai, Terence P. Speed, and Sandrine Dudoit. 2014. “Normalization of RNA-seq data using factor analysis of control genes or samples.” Nature Biotechnology 32 (9): 896–902. doi:10.1038/nbt.2931.

Robinson, Mark D., and Alicia Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biology 11 (3): R25. doi:10.1186/gb-2010-11-3-r25.

Soneson, Charlotte, Michael I Love, and Mark D Robinson. 2016. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” Journal Article. F1000Research 4: 1521. doi:10.12688/f1000research.7563.2.